Executive Summary

This report describes the process behind the creation of a Machine Learning Model used to classify weight lifting exercise (unilateral dumbbell biceps curling) in classes:

More about the research and data used can be found on the following website: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz6CzLP0YxO

TODO: RESULTADO DO MODELO

Exploratory Data Analysis

Our data source urls:

TRAINING_SOURCE_FILE_URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
TESTING_SOURCE_FILE_URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

Loading and splitting the data (training, validating and testing):

NA_STRINGS <- c("NA","#DIV/0!")
training <- read.csv(TRAINING_FILE_PATH, na.strings = NA_STRINGS)
testing <- read.csv(TESTING_FILE_PATH, na.strings = NA_STRINGS)
in.training <- createDataPartition(y = training$class, p = 0.7, list = FALSE)
validating <- training[-in.training, ]
training <- training[in.training, ]

How our training dataset looks like:

dim(training)
## [1] 13737   160
print(table(training$class))
## 
##    A    B    C    D    E 
## 3906 2658 2396 2252 2525

Checking the presence of NAs per variable:

na.stats
## 
## (-0.001,0.05]      (0.95,1] 
##            60           100

For a large number of variables, they have 95% of more of NAs values. These variables will be ignored on our models.

Outliers.

You can see more details about the training dataset on the Appendix.

Models

Remove unecessary columns

NZV

Outlier removal.

Normalization?

Models

Model selection

Accuracy and Residual Analsysis

Prediction

Conclusion

Appendix

Variables there are being ignored:

unwanted.columns
## [1] "X"                    "raw_timestamp_part_1" "raw_timestamp_part_2"
## [4] "cvtd_timestamp"       "new_window"           "num_window"          
## [7] "user_name"
almost.empty.columns
##   [1] "kurtosis_yaw_belt"        "skewness_yaw_belt"       
##   [3] "kurtosis_yaw_dumbbell"    "skewness_yaw_dumbbell"   
##   [5] "kurtosis_yaw_forearm"     "skewness_yaw_forearm"    
##   [7] "kurtosis_picth_forearm"   "skewness_pitch_forearm"  
##   [9] "kurtosis_roll_forearm"    "skewness_roll_forearm"   
##  [11] "max_yaw_forearm"          "min_yaw_forearm"         
##  [13] "amplitude_yaw_forearm"    "kurtosis_picth_arm"      
##  [15] "skewness_pitch_arm"       "kurtosis_roll_arm"       
##  [17] "skewness_roll_arm"        "kurtosis_picth_belt"     
##  [19] "skewness_roll_belt.1"     "kurtosis_yaw_arm"        
##  [21] "skewness_yaw_arm"         "kurtosis_roll_belt"      
##  [23] "skewness_roll_belt"       "max_yaw_belt"            
##  [25] "min_yaw_belt"             "amplitude_yaw_belt"      
##  [27] "kurtosis_roll_dumbbell"   "skewness_roll_dumbbell"  
##  [29] "max_yaw_dumbbell"         "min_yaw_dumbbell"        
##  [31] "amplitude_yaw_dumbbell"   "kurtosis_picth_dumbbell" 
##  [33] "skewness_pitch_dumbbell"  "max_roll_belt"           
##  [35] "max_picth_belt"           "min_roll_belt"           
##  [37] "min_pitch_belt"           "amplitude_roll_belt"     
##  [39] "amplitude_pitch_belt"     "var_total_accel_belt"    
##  [41] "avg_roll_belt"            "stddev_roll_belt"        
##  [43] "var_roll_belt"            "avg_pitch_belt"          
##  [45] "stddev_pitch_belt"        "var_pitch_belt"          
##  [47] "avg_yaw_belt"             "stddev_yaw_belt"         
##  [49] "var_yaw_belt"             "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "max_roll_arm"            
##  [61] "max_picth_arm"            "max_yaw_arm"             
##  [63] "min_roll_arm"             "min_pitch_arm"           
##  [65] "min_yaw_arm"              "amplitude_roll_arm"      
##  [67] "amplitude_pitch_arm"      "amplitude_yaw_arm"       
##  [69] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [71] "min_roll_dumbbell"        "min_pitch_dumbbell"      
##  [73] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
##  [75] "var_accel_dumbbell"       "avg_roll_dumbbell"       
##  [77] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
##  [79] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
##  [81] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
##  [83] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
##  [85] "max_roll_forearm"         "max_picth_forearm"       
##  [87] "min_roll_forearm"         "min_pitch_forearm"       
##  [89] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
##  [91] "var_accel_forearm"        "avg_roll_forearm"        
##  [93] "stddev_roll_forearm"      "var_roll_forearm"        
##  [95] "avg_pitch_forearm"        "stddev_pitch_forearm"    
##  [97] "var_pitch_forearm"        "avg_yaw_forearm"         
##  [99] "stddev_yaw_forearm"       "var_yaw_forearm"

With the removal of variables with too many NAs, there are no more NearZeroVars as well:

nzv <- nearZeroVar(training, saveMetrics = TRUE)
print(nzv[nzv$nzv,])
## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

Boxplot for each numeric variable per classe: